Determining an Optimal Set of Flesh Points on Tongue, Lips, and Jaw for Continuous Silent Speech Recognition
نویسندگان
چکیده
Articulatory data have gained increasing interest in speech recognition with or without acoustic data. Electromagnetic articulograph (EMA) is one of the affordable, currently used techniques for tracking the movement of flesh points on articulators (e.g., tongue) during speech. Determining an optimal set of sensors is important for optimizing the clinical applications of EMA data, due to the inconvenience of attaching sensors on tongue and other intraoral articulators, particularly for patients with neurological diseases. A recent study found an optimal set (tongue tip and body back, upper and lower lips) on tongue and lips for isolated phoneme, word, or short phrase classification from articulatory movement data. This four-sensor set, however, has not been verified in continuous silent speech recognition. In this paper, we investigated the use of data from sensor combinations in continuous speech recognition to verify the finding using a publicly available data set MOCHA-TIMIT. The long-standing speech recognition approach Gaussian mixture model (GMM)-hidden Markov model (HMM) and a recently available approach deep neural network (DNN)-HMM were used as the recognizers. Experimental results confirmed that the four-sensor set is optimal out of the full set of sensors on tongue, lips, and jaw. Adding upper incisor and/or velum data further improved the recognition performance slightly.
منابع مشابه
Towards a segmental vocoder driven by ultrasound and optical images of the tongue and lips
This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphon...
متن کاملPhone recognition from ultrasound and optical video sequences for a silent speech interface
Latest results on continuous speech phone recognition from video observations of the tongue and lips are described in the context of an ultrasound-based silent speech interface. The study is based on a new 61-minute audiovisual database containing ultrasound sequences of the tongue as well as both frontal and lateral view of the speaker’s lips. Phonetically balanced and exhibiting good diphone ...
متن کاملDevelopment of a silent speech interface driven by ultrasound and optical images of the tongue and lips
This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a “silent speech interface” application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio–visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extra...
متن کاملA Visual Speech Recognition System for an Ultrasound-based Silent Speech Interface
The development of a continuous visual speech recognizer for a silent speech interface has been investigated using a visual speech corpus of ultrasound and video images of the tongue and lips. By using high-speed visual data and tied-state cross-word triphone HMMs, and including syntactic information via domain-specific language models, word-level recognition accuracy as high as 72% was achieve...
متن کاملContinuous-speech phone recognition from ultrasound and optical images of the tongue and lips
The article describes a video-only speech recognition system for a “silent speech interface” application, using ultrasound and optical images of the voice organ. A one-hour audiovisual speech corpus was phonetically labeled using an automatic speech alignment procedure and robust visual feature extraction techniques. HMM-based stochastic models were estimated separately on the visual and acoust...
متن کامل